{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LAB 3b:  BigQuery ML Model Linear Feature Engineering/Transform.\n",
    "\n",
    "**Learning Objectives**\n",
    "\n",
    "1. Create and evaluate linear model with BigQuery's ML.FEATURE_CROSS\n",
    "1. Create and evaluate linear model with BigQuery's ML.FEATURE_CROSS and ML.BUCKETIZE\n",
    "1. Create and evaluate linear model with ML.TRANSFORM\n",
    "\n",
    "\n",
    "## Introduction \n",
    "In this notebook, we will create multiple linear models to predict the weight of a baby before it is born, using increasing levels of feature engineering using BigQuery ML. If you need a refresher, you can go back and look how we made a baseline model in the previous notebook [BQML Baseline Model](../solutions/3a_bqml_baseline_babyweight.ipynb).\n",
    "\n",
    "We will create and evaluate a linear model using BigQuery's ML.FEATURE_CROSS, create and evaluate a linear model using BigQuery's ML.FEATURE_CROSS and ML.BUCKETIZE, and create and evaluate a linear model using BigQuery's ML.TRANSFORM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hJ7ByvoXzpVI"
   },
   "source": [
    "## Load necessary libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mC9K9Dpx1ztf"
   },
   "source": [
    "Check that the Google BigQuery library is installed and if not, install it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 609
    },
    "colab_type": "code",
    "id": "RZUQtASG10xO",
    "outputId": "5612d6b0-9730-476a-a28f-8fdc14f4ecde"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "google-cloud-bigquery==1.6.1\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "pip freeze | grep google-cloud-bigquery==1.6.1 || \\\n",
    "pip install google-cloud-bigquery==1.6.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "clnaaqQsXkwC"
   },
   "source": [
    "## Verify tables exist\n",
    "\n",
    "Run the following cells to verify that we previously created the dataset and data tables. If not, go back to lab [1b_prepare_data_babyweight](../solutions/1b_prepare_data_babyweight.ipynb) to create them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight_pounds</th>\n",
       "      <th>is_male</th>\n",
       "      <th>mother_age</th>\n",
       "      <th>plurality</th>\n",
       "      <th>gestation_weeks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [weight_pounds, is_male, mother_age, plurality, gestation_weeks]\n",
       "Index: []"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "-- LIMIT 0 is a free query; this allows us to check that the table exists.\n",
    "SELECT * FROM babyweight.babyweight_data_train\n",
    "LIMIT 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight_pounds</th>\n",
       "      <th>is_male</th>\n",
       "      <th>mother_age</th>\n",
       "      <th>plurality</th>\n",
       "      <th>gestation_weeks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [weight_pounds, is_male, mother_age, plurality, gestation_weeks]\n",
       "Index: []"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "-- LIMIT 0 is a free query; this allows us to check that the table exists.\n",
    "SELECT * FROM babyweight.babyweight_data_eval\n",
    "LIMIT 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "FbSRbuJ-fYtK"
   },
   "source": [
    "## Model 1:  Apply the ML.FEATURE_CROSS clause to categorical features\n",
    "\n",
    "BigQuery ML now has ML.FEATURE_CROSS, a pre-processing clause that performs a feature cross with syntax ML.FEATURE_CROSS(STRUCT(features), degree) where features are comma-separated categorical columns and degree is highest degree of all combinations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create model with feature cross."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "Z3U2FxVklrlU"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE MODEL\n",
    "    babyweight.model_1\n",
    "\n",
    "OPTIONS (\n",
    "    MODEL_TYPE=\"LINEAR_REG\",\n",
    "    INPUT_LABEL_COLS=[\"weight_pounds\"],\n",
    "    L2_REG=0.1,\n",
    "    DATA_SPLIT_METHOD=\"NO_SPLIT\") AS\n",
    "\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks,\n",
    "    ML.FEATURE_CROSS(\n",
    "        STRUCT(\n",
    "            is_male,\n",
    "            plurality)\n",
    "    ) AS gender_plurality_cross\n",
    "FROM\n",
    "    babyweight.babyweight_data_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "G6tpoYhcIgs4"
   },
   "source": [
    "#### Create two SQL statements to evaluate the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "t10fGqSfIgtA"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "      <th>mean_squared_error</th>\n",
       "      <th>mean_squared_log_error</th>\n",
       "      <th>median_absolute_error</th>\n",
       "      <th>r2_score</th>\n",
       "      <th>explained_variance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0.829803</td>\n",
       "      <td>1.139565</td>\n",
       "      <td>0.02041</td>\n",
       "      <td>0.67478</td>\n",
       "      <td>0.345091</td>\n",
       "      <td>0.345093</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
       "0             0.829803            1.139565                 0.02041   \n",
       "\n",
       "   median_absolute_error  r2_score  explained_variance  \n",
       "0                0.67478  0.345091            0.345093  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    ML.EVALUATE(MODEL babyweight.model_1,\n",
    "    (\n",
    "    SELECT\n",
    "        weight_pounds,\n",
    "        is_male,\n",
    "        mother_age,\n",
    "        plurality,\n",
    "        gestation_weeks,\n",
    "        ML.FEATURE_CROSS(\n",
    "            STRUCT(\n",
    "                is_male,\n",
    "                plurality)\n",
    "        ) AS gender_plurality_cross\n",
    "    FROM\n",
    "        babyweight.babyweight_data_eval\n",
    "    ))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "uC-gyvAmIgtE"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.067504</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       rmse\n",
       "0  1.067504"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    SQRT(mean_squared_error) AS rmse\n",
    "FROM\n",
    "    ML.EVALUATE(MODEL babyweight.model_1,\n",
    "    (\n",
    "    SELECT\n",
    "        weight_pounds,\n",
    "        is_male,\n",
    "        mother_age,\n",
    "        plurality,\n",
    "        gestation_weeks,\n",
    "        ML.FEATURE_CROSS(\n",
    "            STRUCT(\n",
    "                is_male,\n",
    "                plurality)\n",
    "        ) AS gender_plurality_cross\n",
    "    FROM\n",
    "        babyweight.babyweight_data_eval\n",
    "    ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model 2:  Apply the BUCKETIZE Function \n",
    "\n",
    "Bucketize is a pre-processing function that creates \"buckets\" (e.g bins) - e.g. it bucketizes a continuous numerical feature into a string feature with bucket names as the value with syntax ML.BUCKETIZE(feature, split_points) with split_points being an array of numerical points to determine bucket bounds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE MODEL\n",
    "    babyweight.model_2\n",
    "\n",
    "OPTIONS (\n",
    "    MODEL_TYPE=\"LINEAR_REG\",\n",
    "    INPUT_LABEL_COLS=[\"weight_pounds\"],\n",
    "    L2_REG=0.1,\n",
    "    DATA_SPLIT_METHOD=\"NO_SPLIT\") AS\n",
    "\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks,\n",
    "    ML.FEATURE_CROSS(\n",
    "        STRUCT(\n",
    "            is_male,\n",
    "            ML.BUCKETIZE(\n",
    "                mother_age,\n",
    "                GENERATE_ARRAY(15, 45, 1)\n",
    "            ) AS bucketed_mothers_age,\n",
    "            plurality,\n",
    "            ML.BUCKETIZE(\n",
    "                gestation_weeks,\n",
    "                GENERATE_ARRAY(17, 47, 1)\n",
    "            ) AS bucketed_gestation_weeks\n",
    "        )\n",
    "    ) AS crossed\n",
    "FROM\n",
    "    babyweight.babyweight_data_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AVPXGKZ374v7"
   },
   "source": [
    "Let's now retrieve the training statistics and evaluate the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>training_run</th>\n",
       "      <th>iteration</th>\n",
       "      <th>loss</th>\n",
       "      <th>eval_loss</th>\n",
       "      <th>learning_rate</th>\n",
       "      <th>duration_ms</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.056824</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>38752</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   training_run  iteration      loss eval_loss learning_rate  duration_ms\n",
       "0             0          0  1.056824      None          None        38752"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT * FROM ML.TRAINING_INFO(MODEL babyweight.model_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now evaluate our model on our eval dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "      <th>mean_squared_error</th>\n",
       "      <th>mean_squared_log_error</th>\n",
       "      <th>median_absolute_error</th>\n",
       "      <th>r2_score</th>\n",
       "      <th>explained_variance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0.798202</td>\n",
       "      <td>1.057588</td>\n",
       "      <td>0.017775</td>\n",
       "      <td>0.650903</td>\n",
       "      <td>0.392203</td>\n",
       "      <td>0.392205</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
       "0             0.798202            1.057588                0.017775   \n",
       "\n",
       "   median_absolute_error  r2_score  explained_variance  \n",
       "0               0.650903  0.392203            0.392205  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    ML.EVALUATE(MODEL babyweight.model_2,\n",
    "    (\n",
    "    SELECT\n",
    "        weight_pounds,\n",
    "        is_male,\n",
    "        mother_age,\n",
    "        plurality,\n",
    "        gestation_weeks,\n",
    "        ML.FEATURE_CROSS(\n",
    "            STRUCT(\n",
    "                is_male,\n",
    "                ML.BUCKETIZE(\n",
    "                    mother_age,\n",
    "                    GENERATE_ARRAY(15, 45, 1)\n",
    "                ) AS bucketed_mothers_age,\n",
    "                plurality,\n",
    "                ML.BUCKETIZE(\n",
    "                    gestation_weeks,\n",
    "                    GENERATE_ARRAY(17, 47, 1)\n",
    "                ) AS bucketed_gestation_weeks\n",
    "            )\n",
    "        ) AS crossed\n",
    "    FROM\n",
    "        babyweight.babyweight_data_eval))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's select the `mean_squared_error` from the evaluation table we just computed and square it to obtain the rmse."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.028391</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       rmse\n",
       "0  1.028391"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    SQRT(mean_squared_error) AS rmse\n",
    "FROM\n",
    "    ML.EVALUATE(MODEL babyweight.model_2,\n",
    "    (\n",
    "    SELECT\n",
    "        weight_pounds,\n",
    "        is_male,\n",
    "        mother_age,\n",
    "        plurality,\n",
    "        gestation_weeks,\n",
    "        ML.FEATURE_CROSS(\n",
    "            STRUCT(\n",
    "                is_male,\n",
    "                ML.BUCKETIZE(\n",
    "                    mother_age,\n",
    "                    GENERATE_ARRAY(15, 45, 1)\n",
    "                ) AS bucketed_mothers_age,\n",
    "                plurality,\n",
    "                ML.BUCKETIZE(\n",
    "                    gestation_weeks,\n",
    "                    GENERATE_ARRAY(17, 47, 1)\n",
    "                ) AS bucketed_gestation_weeks\n",
    "            )\n",
    "        ) AS crossed\n",
    "    FROM\n",
    "        babyweight.babyweight_data_eval))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model 3:  Apply the TRANSFORM clause\n",
    "\n",
    "Before we perform our prediction, we should encapsulate the entire feature set in a TRANSFORM clause. This way we can have the same transformations applied for training and prediction without modifying the queries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's apply the TRANSFORM clause to the model_3 and run the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE MODEL\n",
    "    babyweight.model_3\n",
    "\n",
    "TRANSFORM(\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks,\n",
    "    ML.FEATURE_CROSS(\n",
    "        STRUCT(\n",
    "            is_male,\n",
    "            ML.BUCKETIZE(\n",
    "                mother_age,\n",
    "                GENERATE_ARRAY(15, 45, 1)\n",
    "            ) AS bucketed_mothers_age,\n",
    "            plurality,\n",
    "            ML.BUCKETIZE(\n",
    "                gestation_weeks,\n",
    "                GENERATE_ARRAY(17, 47, 1)\n",
    "            ) AS bucketed_gestation_weeks\n",
    "        )\n",
    "    ) AS crossed\n",
    ")\n",
    "\n",
    "OPTIONS (\n",
    "    MODEL_TYPE=\"LINEAR_REG\",\n",
    "    INPUT_LABEL_COLS=[\"weight_pounds\"],\n",
    "    L2_REG=0.1,\n",
    "    DATA_SPLIT_METHOD=\"NO_SPLIT\") AS\n",
    "\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    babyweight.babyweight_data_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's retrieve the training statistics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>training_run</th>\n",
       "      <th>iteration</th>\n",
       "      <th>loss</th>\n",
       "      <th>eval_loss</th>\n",
       "      <th>learning_rate</th>\n",
       "      <th>duration_ms</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.056824</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>40302</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   training_run  iteration      loss eval_loss learning_rate  duration_ms\n",
       "0             0          0  1.056824      None          None        40302"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT * FROM ML.TRAINING_INFO(MODEL babyweight.model_3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now evaluate our model on our eval dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "      <th>mean_squared_error</th>\n",
       "      <th>mean_squared_log_error</th>\n",
       "      <th>median_absolute_error</th>\n",
       "      <th>r2_score</th>\n",
       "      <th>explained_variance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0.798202</td>\n",
       "      <td>1.057588</td>\n",
       "      <td>0.017775</td>\n",
       "      <td>0.650902</td>\n",
       "      <td>0.392203</td>\n",
       "      <td>0.392205</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
       "0             0.798202            1.057588                0.017775   \n",
       "\n",
       "   median_absolute_error  r2_score  explained_variance  \n",
       "0               0.650902  0.392203            0.392205  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    ML.EVALUATE(MODEL babyweight.model_3,\n",
    "    (\n",
    "    SELECT\n",
    "        *\n",
    "    FROM\n",
    "        babyweight.babyweight_data_eval\n",
    "    ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's select the `mean_squared_error` from the evaluation table we just computed and square it to obtain the rmse."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.028391</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       rmse\n",
       "0  1.028391"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    SQRT(mean_squared_error) AS rmse\n",
    "FROM\n",
    "    ML.EVALUATE(MODEL babyweight.model_3,\n",
    "    (\n",
    "    SELECT\n",
    "        *\n",
    "    FROM\n",
    "        babyweight.babyweight_data_eval\n",
    "    ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab Summary: \n",
    "In this lab, we created and evaluated a linear model using BigQuery's ML.FEATURE_CROSS, created and evaluated a linear model using BigQuery's ML.FEATURE_CROSS and ML.BUCKETIZE, and created and evaluated a linear model using BigQuery's ML.TRANSFORM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "10.3.2019-DRAFT-_1 - FeatEnG - LABS.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
